The International Corpus of Arabic: Compilation, Analysis and Evaluation
نویسندگان
چکیده
This paper focuses on a project for building the first International Corpus of Arabic (ICA). It is planned to contain 100 million analyzed tokens with an interface which allows users to interact with the corpus data in a number of ways [ICA website]. ICA is a representative corpus of Arabic that has been initiated in 2006, it is intended to cover the Modern Standard Arabic (MSA) language as being used all over the Arab world. ICA has been analyzed by Bibliotheca Alexandrina Morphological Analysis Enhancer (BAMAE). BAMAE is based on Buckwalter Arabic Morphological Analyzer (BAMA). Precision and Recall are the evaluation measures used to evaluate the BAMAE system. At this point, Precision measurement ranges from 95%-92% while recall measurement was 92%-89%. This depends on the number of qualifiers retrieved for every word. The percentages are expected to rise by implementing the improvements while working on larger amounts of data.
منابع مشابه
Building an International Corpus of Arabic (ICA): Progress of Compilation Stage
This paper focuses on three axes. The first axis gives a survey of the importance of corpora in language studies e.g. lexicography, grammar, semantics, Natural Language Processing and other areas. The second axis demonstrates how the Arabic language lacks textual resources, such as corpora and tools for corpus analysis and the effected of this lack on the quality of Arabic language applications...
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملA Conversation Analysis of Ellipsis and Substitution in Global Business English Textbooks
Despite the body of research on textbook evaluation from the discourse analysis perspective, cohesive devices have rarely been analyzed in English for Specific Purposes (ESP) textbooks. The acquisition and use of cohesive devices is inherent to naturalistic communication, including business interactions. Hence, L2 learners of business English should be exposed to these devices through cohesion-...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملEncapsulation of Peppermint Oil with Arabic Gum-gelatin by Complex Coacervation Method
The gelatin/gum Arabic microcapsules encapsulating peppermint oil were prepared by complex coacervation using tannic acid as hardening agent. The effects of various parameters, including concentration of wall material, core material, tannic acid and tween80 were investigated on particle size and encapsulation efficiency. For statistical evaluation of the parameters, Taguchi method has been used...
متن کامل